Introduction

Starbucks’ gargantuan $26 billion of revenue in 2019 is resounding evidence that Americans, above everything, love their coffee. They prove it time and time again, most recently during the current pandemic: some, unable to relinquish their daily ‘bucks, waited hours on end in long drive-thru lines to get their fix when coffee shops were closed for in-person operations.

It is no surprise, then, that Starbucks owns more than 8,000 stores across the US, and continues to grow everyday. These sheer numbers are reflective of a company that is about more than just coffee, having a significant influence on US society and culture with (often controversial) initiatives such as removing religious references from holiday-themed cups. In addition, Starbucks is known for treating their employees extremely well by providing health coverage, tuition coverage, and 401(k) plans, which again reinforces their highly-regarded brand.

As such, the astounding popularity of the chain in the country and the prospect of further expansion raises important questions. What can the current distribution of Starbucks stores tell us about societal factors across the country? Which factors should the brand consider when expanding into new locations? How can companies like Starbucks, which pride themselves in a positive social and environmental outlook, incorporate such values into their corporate strategy - especially in the current pandemic?

Overview of Analysis

In order to answer these questions, we sought to understand the relationship between state-level social and economic factors, such as income and unemployment; and the distribution of Starbucks locations in the US. Furthermore, we wanted to understand what company behavior and decision-making could look like with a more social-minded approach, instead of considering sheer profits. Starbucks shops, after all, can be significant contributors to social good on a localized level by providing stable jobs that open up avenues for growth and education.

As such, in this project, we conduct a consulting case-study of Starbucks, analysing what insights are gained from the current distribution of stores in the US and providing socially-minded recommendations for new locations in which the brand can expand. We start this endeavor with a spatial analysis of Starbucks stores across the US, exploring the relationship between store density and income/unemployment level in each state. We then complement this analysis with a text-based sentiment analysis of recent tweets mentioning Starbucks, and explore how these sentiments vary across location. Finally, we use these variables to conduct a clustering analysis of states in the US, in order to determine the different “types” of states under which Starbucks operates. We proceed to use these clusters to identify the ideal locations for the brand to build new stores.

Data

In order to conduct this analysis, we collected the following data:

  • The Starbucks Locations Worldwide dataset available on Kaggle and provided by Starbucks Corporation, including a record for every Starbucks or subsidiary store location that was operation during February 2017.
  • The US Household Income Statistics dataset, available on Kaggle and provided by the Golden Oak Research group, containing information on US cities’ average income.
  • The US Unemployment Rate by County dataset avaiable on Kaggle and provided by the US Department of Labor’s Bureau of Labor Statistics, containing unemployment and population data at the US county level from 1990-2016.
  • The most recent 18,000 tweets containing the string “Starbucks” or “starbucks” in their text, scraped using the RTweet package to interface with the Twitter API.

Check the links in the bullets above for access to each of the data sources!

Distribution of Starbucks Across the US

The first step of our analysis was to understand how Starbucks stores are distributed across the US:

Looking at the plot above, displaying the distribution and density of Starbuck locations across the US, we can see that Starbucks locations are extremely clustered along the coast of the US, and less prominent within states in the Rocky Mountain and Midwestern regions of the country. This is evidence that there is a correlation between income and Starbuck store location, as we see a concentration of stores in the richer parts of the country, such as several states in the Northeast and California. This is corroborated by the heavy concentration of Starbucks in the major cities within each state, as opposed to being spread out across the state’s counties. Looking at the map, one can easily identify major cities such as New York, Los Angeles, Houston, Miami and Boston as Starbucks hotspots.

With strong evidence of high spatial concentration of starbucks stores and a correlation between income and number of stores, we wanted to understand if that gave rise to any patterns regarding sentiment towards Starbucks. Specifically, we theorized that areas with more Starbucks stores or higher income should have a more positive sentiment towards the brand. In the following section, we test this hypothesis using text-based sentiment analysis of recent tweets that mention Starbucks.

Cosumer Sentiments Towards Starbucks Across US

After using the RTweets package to scrape text and locations for the latests 18,000 tweets mentioning Starbucks, we calculated the proportion of negative/positive words (as defined by NRC Lexicon) in each state.

As mentioned previously, our initial hypothesis was that states with larger densities of starbucks would tweet more positive and less negative opinions. However, when observing the plots below, states with more Starbucks, such as California, Texas, Florida, and New York, all have a lower proportion of positive words in tweets and high proportion of negative words in tweets compared to other states in the US. Additionally, the states with a lower amount of Starbucks, such as Wisconsin, Iowa, Kansas, and Mississippi, had a higher proportion of positive words and lower proportion of negative words in tweets. This leads to the conclusion that it may be more beneficial for Starbucks to build its stores in states it is not currently heavily established in, according to consumer sentiment analysis.

Clustering by State

Motivated by the potential that Starbucks might want to expand into areas it has not heavily invested in, we sought to understand how we could segment states into categories in order to determine how Starbucks might strategize differently in each of them.

State-Level Factors

We thus employed a k-means clustering algorithm to US States, using the following variables in order to categorize them:

  • Average Proportion of Negative Sentiment in Tweets
  • Average Income
  • Average Unemployment Rate
  • Number of Starbucks

State-Level Clusters

Using the elbow plot below, we found that 5 clusters would get us the most efficient information gain while keeping results interpretable.

Visualization of Cluster

The results of the clustering algorithm are displayed in the visualization below:

Cluster Characteristics

The matrix below shows the distribution of each cluster among the four categories. Cluster 1, which is composed solely of California, is characterized by having the highest number of Starbucks stores by a large margin (a total of 2821!). It’s also interesting to note that California has high average income and a high unemployment rate, which is unique since there is an inverse correlation between the two in other states. Cluster 2 represents the states with lower income than most but also relatively low unemployment, being a special case of the less-wealthy states. Cluster 3, composed solely of Vermont, is characterized by having the highest negative consumer sentiment. Cluster 4 represents the most wealthy states in the country, having high average income and a median unemployment rate. Finally, Cluster 5 represents the most impoverished states in the US, having a high unemployment rate, low average income and low number of Starbucks stores.

Final List of States

Having segmented states by their characteristics, we decided to pick one from each cluster in order to determine what a socially-minded strategy for Starbucks within that group of states might look like. To do so, we decided to pick the centermost state within each cluster, as the state “most representative” of that cluster. For California and Vermont, this was trivial, since they were the only states within their respective clusters. For the other three clusters, we calculated the distance from each state to its respective center and picked the states with the minimum distance. From this analysis, we concluded the final 5 states, each representative of their own cluster, are:

  • Virginia
  • Vermont
  • Tennessee
  • Iowa
  • California

Clustering by County

County-Level Factors

  • Average Income
  • Average Unemployment Rate
  • Number of Starbucks

Step-by Step Process

  1. Cluster Analysis
  2. Creating clusters based on appropriate number of centers
  3. Determine the most suitable cluster of counties

Determining Amount of Clusters for Each State

Virginia

## TableGrob (12 x 11) "layout": 18 grobs
##     z         cells      name                                         grob
## 1   5 ( 6- 6, 4- 4)    spacer                               zeroGrob[NULL]
## 2   7 ( 7- 7, 4- 4)    axis-l         absoluteGrob[GRID.absoluteGrob.1483]
## 3   3 ( 8- 8, 4- 4)    spacer                               zeroGrob[NULL]
## 4   6 ( 6- 6, 5- 5)    axis-t                               zeroGrob[NULL]
## 5   1 ( 7- 7, 5- 5)     panel                    gTree[panel-1.gTree.1481]
## 6   9 ( 8- 8, 5- 5)    axis-b         absoluteGrob[GRID.absoluteGrob.1482]
## 7   4 ( 6- 6, 6- 6)    spacer                               zeroGrob[NULL]
## 8   8 ( 7- 7, 6- 6)    axis-r                               zeroGrob[NULL]
## 9   2 ( 8- 8, 6- 6)    spacer                               zeroGrob[NULL]
## 10 10 ( 5- 5, 5- 5)    xlab-t                               zeroGrob[NULL]
## 11 11 ( 9- 9, 5- 5)    xlab-b zeroGrob[axis.title.x.bottom..zeroGrob.1484]
## 12 12 ( 7- 7, 3- 3)    ylab-l   zeroGrob[axis.title.y.left..zeroGrob.1485]
## 13 13 ( 7- 7, 7- 7)    ylab-r                               zeroGrob[NULL]
## 14 14 ( 7- 7, 9- 9) guide-box                            gtable[guide-box]
## 15 15 ( 4- 4, 5- 5)  subtitle       zeroGrob[plot.subtitle..zeroGrob.1524]
## 16 16 ( 3- 3, 5- 5)     title        titleGrob[plot.title..titleGrob.1523]
## 17 17 (10-10, 5- 5)   caption      titleGrob[plot.caption..titleGrob.1528]
## 18 18 ( 2- 2, 2- 2)       tag            zeroGrob[plot.tag..zeroGrob.1525]

Cluster characteristics, explain why we chose cluster 4

Vermont

Cluster Analysis

Cluster characteristics, explain why we chose cluster 1

Tennessee

Cluster Analysis

Visualizing clusters

Cluster characteristics, explain why we chose cluster 4

Iowa

Cluster Analysis

Visualizing clusters

Cluster characteristics, explain why we chose cluster 5

California

Cluster Analysis

Visualizing clusters

Cluster characteristics, explain why we chose cluster 1

Final Recommendations

Summarize overall process and thinking and impact of starbucks. Show the data table of all important counties from clusters we chose:

County Avg. Income Num. Starbucks Avg. Unemployment Rate
Fresno 55523.57 36 12.326804
Glenn 53148.00 0 11.998625
Kern 49580.61 39 11.487972
Madera 49355.76 5 11.890378
Merced 49976.11 7 13.486942
Modoc 43340.33 0 10.394502
Monterey 68913.00 1 9.826804
Plumas 64924.83 0 11.346392
San Joaquin 71281.69 24 10.629553
Santa Cruz 82669.83 7 11.217886
Shasta 53369.18 10 9.644330
Stanislaus 48900.86 7 11.919244
Tehama 48808.75 1 9.508935
Tulare 45336.69 17 13.573540
Yuba 55171.50 0 12.325086
Adams 65864.50 0 6.569547
Appanoose 49003.50 0 5.678395
Benton 57531.00 0 6.154527
Buchanan 63804.00 0 7.088080
Butler 61836.00 0 5.995722
Calhoun 43333.00 0 7.597668
Carroll 55611.00 0 5.938277
Cherokee 55162.00 0 6.286043
Clarke 43619.50 0 7.021777
Clay 46965.00 0 6.565042
Clayton 54832.50 0 5.518210
Clinton 59001.00 0 6.144743
Crawford 51775.50 0 6.761908
Dallas 78098.67 0 7.271241
Delaware 55472.17 0 5.291101
Fayette 55887.25 0 6.797021
Franklin 50078.00 0 5.943186
Greene 58107.83 0 6.827746
Grundy 64870.50 0 6.497376
Hardin 58296.00 0 6.925958
Henry 51271.00 0 6.525775
Jackson 47535.00 0 6.540742
Jasper 59793.11 1 6.459512
Jefferson 56939.00 0 6.596809
Linn 73450.25 9 7.200540
Lyon 54713.00 0 5.473609
Marion 47388.00 0 7.014973
Marshall 60222.00 0 6.233670
Osceola 49948.00 0 5.801299
Pocahontas 50181.50 0 6.703086
Union 43619.50 0 6.206117
Warren 60983.96 7 5.924982
Washington 49947.33 0 5.938779
Webster 54712.17 0 6.443786
Chittenden 83283.50 0 3.387654
Brunswick 35873.00 0 7.469872
Cumberland 46083.00 0 6.302316
Grayson 32076.00 0 7.063253
Greensville 33192.67 0 5.819667
Halifax 29411.00 0 8.662019
Lee 31678.50 0 7.173886
Northampton 39718.25 0 6.739979
Prince Edward 54858.00 0 6.603667
Scott 46810.00 0 5.713129
Smyth 39226.00 0 8.147333
Carter 41428.75 0 6.840901
Chester 24221.00 0 7.115124
Fentress 39775.00 0 9.083333
Hardeman 26503.00 0 7.497459
Lawrence 45854.50 0 7.462405
Lewis 41536.00 0 8.101558
Macon 40533.00 0 7.106605
Monroe 48462.00 0 6.983337
Obion 45863.00 0 7.635802
Rhea 38409.50 0 8.395988
Unicoi 47091.25 0 8.077778
Van Buren 41919.00 0 7.148339
Weakley 40819.00 0 6.972840

Limitations and Conclusion

Paragraph here

Citations

List of important packages/data sets